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Abstract 

u 

Q I Replicated services accessed via quorums enable each access to be performed at only a 

^ ■ subset (quorum) of the servers, and achieve consistency across accesses by requiring any two 

Q , quorums to intersect. Recently, 5-masking quorum systems, whose intersections contain at 

least 26+1 servers, have been proposed to construct replicated services tolerant of b arbitrary 
(Byzantine) server failures. In this paper we consider a hybrid fault model allowing benign 
failures in addition to the Byzantine ones. We present four novel constructions for 6-masking 
quorum systems in this model, each of which has optimal load (the probability of access of 
. the busiest server) or optimal availability (probability of some quorum surviving failures). 

OO I To show optimality we also prove lower bounds on the load and availability of any 6-masking 

' quorum system in this model. 

O ■ 1 Introduction 



^ ■ Quorum systems are well known tools for increasing the efficiency of replicated services, as well as 

their availability when servers may fail benignly. A quorum system is a set of subsets (quorums) 
of servers, every pair of which intersect. Quorum systems enable each client operation to be 
performed only at a quorum of the servers, while the intersection property makes it possible to 
preserve consistency among operations at the service. 

Quorum systems work well for environments where servers may fail benignly. However, when 
servers may suffer arbitrary (Byzantine) failures, the intersection property does not suffice for 
maintaining consistency; two quorums may intersect in a subset containing faulty servers only, 
who may deviate arbitrarily and undetectably from their assigned protocol. Malkhi and Reiter 
thus introduced masking quorums systems [|MR98a |, in which each pair of quorums intersects in 



sufficiently many servers to mask out the behavior of faulty servers. More precisely, a b-masking 
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quorum system is one in which any two quorums intersect in 26 + 1 servers, which suffices to 
ensure consistency in the system if at most b servers suffer Byzantine failures. 

In this paper we develop four new constructions for 6-masking quorum systems. For the 
first time in this context, we distinguish between masking Byzantine faults and surviving a 
possibly larger number of benign faults. Our systems remain available in the face of any / 
crashes, where / may be significantly larger than b (such a system is called /-resilient). In 
addition, our constructions demonstrate optimality (ignoring constants) in two widely accepted 
measures of quorum systems, namely load and crash probability. The load {C), a measure of 
best-case performance of the quorum system, is the probability with which the busiest server 
is accessed under the best possible strategy for accessing quorums. The crash probability (Fp) 
is the probability, assuming that each server crashes with independent probability p, that all 
quorums in the system will contain at least one crashed server (and thus will be unavailable). 
The crash probability is an even more refined measure of availability than /, as a good system 
will tolerate many failure configurations with more than / crashes. Three of our systems are 
the first systems to demonstrate optimal load for 6-masking quorum systems, and two of our 
systems each demonstrate optimal crash probability for its resilience /. In proving optimality 
of our constructions, we prove new lower bounds for the load and crash probability of masking 
quorum systems. 

The techniques for achieving our constructions are of interest in themselves. Two of the 
constructions are achieved using a boosting technique, which can transform any regular (i.e., 
benign fault-tolerant) quorum system into a masking quorum system of an appropriately larger 
system. Thus, it makes all known quorum constructions available for Byzantine environments 
(of appropriate sizes). In the analysis of one of our best systems we employ strong results from 
percolation theory. 

The rest of this paper is structured as follows. We review related work and preliminary 
definitions in Sections || and ^, respectively. In Section ^ we prove bounds on the load and crash 
probability for 6-masking quorum systems and introduce quorum composition. In Sections |5|-0 
we describe our new constructions. We discuss our results in Section R. 



2 Related work 



Our work borrows from extensive prior work in benignly fault-tolerant quorum systems (e.g., [Gif7£ 
TW9| , |Mae85| , |GB85| , ger8|, pU87| , FT89| , [AEQI] , |CAA92| , |NW98| , |PW97b|| ). The notion of 
availability we use here (crash probability) is well known in reliability theory |BP75] and has 
been applied extensively in the analysis of quorum systems (cf. P3G87 , PW95| , PW97a| and the 
references therein). The load of a quorum system was first defined and analyzed in | NW9^ , 
which proved a lower bound of ^{^) on the load of any quorum system (and, a fortiori, any 
masking quorum system) over n servers. In proving load-optimality of our constructions, we 
generalize this lower bound to ^{\/ -) for 6-masking quorum systems. 
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Maximum number of Byzantine server failures. 
Size of the smallest quorum in Q. 
Resilience (Definition p. 41 ). 
Crash probability (Definition 3.10| ). 

Size of smallest intersection between any two quorums in Q. 
Load of S (Definition 

Size of a smallest transversal of Q (Definition ^I^ ). 

Number of servers (i.e., \U\ = n). 

Independent probability that each server crashes. 



A quorum system (Definition 3.1). 
Universe of servers. 



Table 1: The notation used in this paper. 



Grids, which form the basis for our M-Grid construction, were proposed in [ Mae85 , CAA92 , 
KRS93I , |MR98a|] . The technique of quorum composition, which we use in our RT and boostFPP 



constructions, has been studied in MP92 , NM92, Nei92] under various names such as "coterie 
join" and "recursive majority". Our M-Path construction generalizes the system of [ WB92(| , 
coupled with the analysis of the Paths construction of |NW98|, and the recent system of [ Baz96| ]. 



Several constructions of masking quorum systems were given in |MR98a] for a variety of 
failure models. For the model we consider here — i.e., any b servers may experience Byzantine 
failures — that work gave two constructions. We compare those constructions to ours in Section 0. 



Hybrid failure models have been considered in other works (e.g., |GP92, LR95 , LR94, RB94]) 



3 Preliminaries 

In this section we introduce notation and definitions used in the remainder of the paper. Much 
of the notation introduced in this section is summarized in Table |l| for quick reference. 

We assume a universe U of servers, \U\ = n, over which our quorum systems will be con- 
structed. Servers that obey their specifications are correct. A faulty server, however, may deviate 
from its specification arbitrarily. We assume that up to b servers may fail arbitrarily and that 
46 < 71, since this is necessary for a 6-masking quorum system to exist | MR98a| . Beginning in 



Section [3.2.2 , we will also distinguish benign (crash) failures as a particular failure of interest. 



and in general there may be more than b such failures. 
3.1 Quorum systems 

Definition 3.1 A quorum system Q C 2^ is a collection of subsets of U, each pair of which 
intersect. Each Q £ Q is called a quorum. 
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We use the following notation. The cardinality of the smallest quorum in Q is denoted by 
c(Q) = min{|Q| : Q E Q}. The size of the smallest intersection between any two quorums is 
denoted by IS{Q) = mm{\Q n R\ : Q,R £ Q}. The degree of an element i G U in a quorum 
system Q is the number of quorums that contain i: deg(i) = \{Q £ Q : i £ Q}\. 

Definition 3.2 A quorum system Q is (s,(i)-fair if \Q\ = s for all Q £ Q and deg{i) = d for 
all i £ U . Q is called fair if it is {s, d)-fair for some s and d. 

Definition 3.3 A set T is a transversal of a quorum system Q if T r\Q ^ for every Q £ Q. 
The cardinality of the smallest transversal is denoted hy MT{Q) = min{|r| : T is a transversal of Q}. 

Regular quorum systems, with XS{Q) = 1, are insufficient to guarantee consistency in case 
of Byzantine failures. Malkhi and Reiter [MR98a] defined several varieties of quorum systems 



for Byzantine environments, which are suitable for different types of services. In this paper we 
focus on masking quorum systems. 



Definition 3.4 [|MR98a|] The resilience / of a quorum system Q is the largest k such that for 



every set K <^U , \K\ = k, there exists Q £ Q such that K f] Q = . 
Remark: The resilience of any quorum system Q is / = MT{Q) — 1. 



Definition 3.5 [ MR98aU A quorum system Q is a 6-masking quorum system if it is resilient to 



f >b failures, and obeys the following consistency requirement: 

VQi,Q2 G Q: IQinQzl >26 + l. (1) 

Remark: Informally, if we view the service as a shared variable which is updated and read by 



the clients, then the resilience requirement of Definition 3.4 ensures that no set of 6 < / faulty 
servers will be able to block update operations (e.g., by causing every update transaction to 
abort). The consistency requirement of Definition ensures that read operations can mask 
out any faulty behavior of up to h servers. Examples of protocols implementing various data 
abstractions using 6-masking quorum systems can be found in |MR98a| , |MR98b| , |MR98c|] . 



Lemma 3.6 Let Q he a quorum system. Then Q is b-masking if both the following conditions 
hold: 

1- MT{Q) > 6 + 1, 
2. 1S{Q) > 26+ 1. 

Proof: Assume that MT{Q) > 6 + 1. To see that Q is resilient to 6 failures, note that if there 
exists some K such that K r\ Q ^ for all Q £ then i^T is a transversal. By the minimality 
we have \K\ > 6 + 1, and we are done. Condition 2 immediately implies □ 

Corollary 3.7 Let Q be a quorum system, and let b = min{A^T(Q) — 1, — ^^2^^ — }. Then Q is 
b-masking. □ 
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3.2 Measures 



The goal of using quorum systems is to increase the availability of replicated services and decrease 
their access costs. A natural question is how well any particular quorum system achieves these 
goals, and moreover, how well it compares with other quorum systems. Several measures will 
be of interest to us. 



3.2.1 Load 

A measure of the inherent performance of a quorum system is its load. Naor and Wool define the 
load of a quorum system as the frequency of accessing the busiest server using the best possible 
strategy [ |NW98 1 . More precisely, given a quorum system Q, an access strategy w is a probability 



distribution on the elements of Q; i.e., Y^Qi^QwiQ) = 1- The value w{Q) > is the frequency 
of choosing quorum Q when the service is accessed. The load is then defined as follows: 

Definition 3.8 Let a strategy w be given for a quorum system Q = {Qi, . . . , Qm} over a uni- 
verse U. For an element u £ U, the load induced by w on u is lw{u) = Y^Qi^u'^iQi)- ^'^"'^ 
induced by a strategy w on a quorum system Q is Cw{Q) = maxu£i/{lw{u)}. The system load on 
a quorum system Q is C{Q) = min^{£^(Q)}, where the minimum is taken over all strategies. 

We reiterate that the load is a best case definition. The load of the quorum system will be 
achieved only if an optimal access strategy is used, and only in the case that no failures occur. 
A strength of this definition is that the load is a property of a quorum system, and not of the 



protocol using it. Examples of load calculations can be found in |Woo96]. As an aside, we note 



that not every quorum system can have a strategy that induces the same load on each server. 



In [HMP97] it is shown that for some quorum systems it is impossible to perfectly balance the 
load. 

Recall that c(Q) denotes the cardinality of the smallest quorum in Q. The following result 



will be useful to us in the sequel (recall Definition 3.2). 



Proposition 3.9 | NW9£ ] Let Q be a fair quorum system. Then C{Q) = c{Q)/n. 



3.2.2 Availability 

By definition a 6-masking quorum system can mask up to b arbitrary (Byzantine) failures. 
However, such a system may be resilient to more benign failures. By benign failures we mean 
any failures that render a server unresponsive, which we refer to as crashes to distinguish them 
from Byzantine failures. 

The resilience / of a quorum system provides one measure of how many crash failures a 
quorum system is guaranteed to survive, and indeed this measure has been used in the past to 
differentiate among quorum systems |BG8(:]. However, it is possible that an /-resilient quorum 



system, though vulnerable to a few failure configurations of / + 1 failures, can survive many 
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configurations of more than / failures. One way to measure this property of a quorum system 
is to assume that each server crashes independently with probability p and then to determine 
the probability Fp that some quorum survives with no faulty members. This is known as crash 
probability and is formally defined as follows: 

Definition 3.10 Assume that each server in the system crashes independently with probability 
p. For every quorum Q & Q let Eq be the event that Q is hit, i.e., at least one element 
i £ Q has crashed. Let crash{Q) be the event that all the quorums Q £ Q were hit, i.e., 
crash{Q) = /\q^q£q- Then the system crash probability is Fp{Q) = F{crash{Q)). 

We would like Fp to be as small as possible. A desirable asymptotic behavior of Fp is that 
— > when n ^ oo for all p < 1/2, and such an Fp is called Condorcet (after the Condorcet 
Jury Theorem Con| ). 



4 Building blocks 

In this section, we prove several theorems which will be our basic tools in the sequel. First we 
prove lower bounds on the load and availability of 6-masking quorum systems, against which 
we measure all our new constructions. Then we prove the properties of a quorum composition 
technique, which we later use extensively. 

4.1 The load and availability of masking quorum systems 

We begin by establishing a lower bound on the load of 6-masking quorum systems, thus tightening 



the lower bound on general quorum systems [NW98| as presented in [ MR98a 



Theorem 4.1 Let Q be a b-masking quorum system. Then C{Q) > max{|^gy, -^}- 

Proof: Let w be any strategy for the quorum system Q, and fix Qi € Q such that \Qi\ = c(Q). 
Summing the loads induced by w on all the elements of Qi, and using the fact that any two 
quorums have at least 26 + 1 elements in common, we obtain: 

E ^-(^) = E E ^(Q^) = E E ^(^^) 

> Y,{2b + l)w{Qi) = 26 + 1. 



26+1 



Therefore, there exists some element in Qi that suffers a load of at least -jq^- 

Similarly, summing the total load induced by w on all of the elements of the universe, and 
using the minimality of c(Q), we get: 
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> j;c(Q)u;(QO = c(Q). 



Therefore, there exists some element in U that suffers a load of at least □ 



Corollary 4.2 Let Q be a b-masking quorum system. Then C{Q) > y^^^, oind equality holds 
^/c(Q) = V(26 + l)nfl □ 



Remark: Corollary |4.2| shows that the threshold construction of [MR98a] in fact has optimal 



load when b = E.g., when b ~ n/4 the obtained load is ~ 0.75, but for such systems 

we can only hope for a constant load of ~ l/\/2 = 0.707. However the load of the threshold 
construction is always > 1/2, which is far from optimal for smaller values of b. 

On the other hand, the grid-based construction of [ |MR98a | does not have optimal load. It 



has quorums of size 0(b^/n) and load of roughly 2b/^/n. In the sequel we show systems which 
significantly improve this: some of our new constructions have quorums of size 0{Vbn) and 
optimal load. 

Our next propositions show lower bounds on the crash probability Fp in terms of AiT{Q) 
and b. 

Proposition 4.3 Let Q be a quorum system. Then Fp{Q) > p-^^(Q) = pf+^ for any p G [0, 1]. 

Proof: Consider a minimal transversal T with |T| = MT{Q). If all the elements of T crash then 
every quorum contains a crashed element, so Fp{Q) > p^'^iQ) , □ 

Proposition 4.4 Let Q be a b-masking quorum system. Then Fp{Q) > p'^(Q)~^^ for any p E 
[0,1]. 

Proof: Let Q G Q be a minimal quorum with \Q\ = c(Q), and consider Z C Q, \Z\ = 2b. Since 
Q is 6-masking then \RnQ\ > 2b + l for any R e Q, and so \{Q \ Z) n R\ > 1 and Q \ Z is a 
transversal. Therefore A4T{Q) < c{Q) — 2b, which we plug into Proposition |4.3| . □ 



The next proposition is less general than Proposition 4.4, however it is applicable for most of 
our constructions and it gives a much tighter bound. 

Proposition 4.5 Let Q be a b-masking quorum system such that AiT{Q) < {IS{Q) + l)/2. 
Then Fp{Q) > p^+^ for any p e [0, 1]. 



Proof: If Mr{Q) < {1S{Q) + 1) /2 then from Corollary [SJ we have that 6+ 1 = A^T(Q), which 
again we plug into Proposition |4.3|. □ 



To avoid repetitive notation, we omit floor and ceiling brackets from expressions for integral quantities. 
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4.2 Quorum system composition 



Quorum system composition is a well known technique for building new systems out of existing 
components. We compose a quorum system S over another system TZ by replacing each element 
of S with a distinct copy of TZ. In other words, when element i is used in a quorum S € S we 
replace it with a complete quorum from the z'th copy of TZ. Using the terminology of reliability 
theory, the system S o TZ has a modular decomposition where each module is a copy of TZ. 
Formally: 

Definition 4.6 Let S and TZ be two quorum systems, over universes of sizes ns and npc, re- 
spectively. Let TZi, . . . ,TZng be ns copies of TZ over disjoint universes. Then the composition of 
S over TZ is 

5 o 7^ = jlj i?i : 5 e 5, i?i G 7^^ for all i e . 

The next theorem summarizes the properties of quorum composition. 
Theorem 4.7 Let S and TZ be two quorum systems, and let Q = S o TZ. Then 

• The universe size is nq = usur. 

• The minimal quorum size is c{Q) = c{S)c{TZ). 

• The minimal intersection size is ZS{Q) = IS{S)ZS{TZ). 

• The minimal transversal size is AiT{Q) = MT{S)MT{TZ). 



Denote the crash probability functions of S and TZ by s{jp) = Fp{S) and r{p) = FpiTZ). 
ThenFpiQ) = s{r{p)). 

• The load is C{Q) = C{S)C{TZ). 

Proof: The behavior of the combinatorial parameters riq, c(Q), T5(Q) and ^AT[Q,) is obvious. 
The behavior of Fp[Q) is standard in reliability theory (cf. [ BP75| ]). As for the load, consider 



the following strategy: pick a quorum S £ S using the optimal strategy for S. Then for each 
element i G S, pick a quorum Ri G TZi using the optimal strategy for (the z'th copy of) TZ. 
Clearly this strategy induces a load of C{S)C{TZ), and hence C{Q) < C{S)C{TZ). 

We now show the inequality in the opposite direction. Enumerate the elements of Q by 
denoting the j'th element in TZi by Uij, let Q{S) = {|J Ri : Ri £ TZi for all i £ S} he the set of 
all quorums that are based on some S £ S, and let he an access strategy on Q. Then 
induces a strategy vu'^ on S defined by 

w^{S)= ^^(Q)- (2) 

QeQiS) 



8 



The load on an element i £ S (i.e., the frequency of accessing the quorum system TZi) is then 
lu,s{i) = X^sgi ses'^^^i'^)- Similarly, induces a strategy on each copy TZi defined by 



w'<-^iR)={ }:^w'^iQ)] /l^sii). (3) 

■QDB. 

This w^' is well defined when l^s{i) > 0. It is easy to verify that w'^ and w^"- are indeed 
strategies, i.e., that the probabilities add up to 1. 

Claim 4.8 Let l,^Q{uij) be the load induced by on an element Uij G TZi, and let l^iZiiuij) be 
the load induced on it by w'^k Then l^aiuij) = lujs{i) • l^TZi{uij). 

Proof of Claim: Using (§) and @ we have that 

R3Uij ^QDR ' ' 

= E E^®(Q) = E ^''(^) = ^-^(^^^■)- ° 

RBUij QDR QBUij 



To complete the proof of Theorem 4.7, assume that is an optimal strategy for Q. Consider 
the copy TZi for which l^sli) is maximal, i.e., C^s{S) = lws{i), and let Uij be the maximally- 
loaded element in this TZi. Clearly l^s{i) > so w^"- is well defined for this i. Note that we 
do not require Uij to be the maximally-loaded element in all of Q. Using the claim and the 
minimality of C{S) and C{TZ) we obtain that 

■^(S) = ^wQ{Q) > Lsiuij) = ly,s{i) -l^n.iuij) 
= £^s{S)-C^n,{TZ) > C{S)C{TZ). 

By combining this inequality with the upper bound we had before we conclude that C{Q) = 
C{S)C{TZ). □ 

The multiplicative behavior of the combinatorial parameters in composing quorum systems 
provides a powerful tool for "boosting" existing constructions into larger systems with possibly 
improved characteristics. Below, we use quorum composition in two cases, and demonstrate that 
this technique yields improved constructions over their basic building blocks, for appropriately 
larger system sizes. In particular, in Section ^ we show a composition that allows us to transform 
any regular quorum construction into a (larger) 6-masking quorum system. 



5 Simple systems 

In this section we show two types of constructions, the multi-grid (denoted M-Grid) and the 
recursive threshold (RT). These systems significantly improve upon the original constructions 



of |lVIR98a], however both are still suboptimal in some parameter: M-Grid has optimal load but 
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Figure 1: The multi-grid construction, n = 7 x 7, 6 = 3, with one quorum shaded. 



can mask only up to 6 = 0{^/n) failures and has poor crash probability; and RT can mask up 
to 6 = 0(n) failures and has near optimal crash probability, but has suboptimal load. 

In Sections ^ and |^ we present systems which are superior to the M-Grid and RT. Nonethe- 
less, we feel that the simplicity of the M-Grid and RT systems, and the fact that they are suitable 
for very small universe sizes, are what makes them appealing. 



5.1 The multi-grid system 

We begin with the M-Grid system, which achieves an optimal load among 6-masking quorum 
systems, where b < {\/n — l)/2. The idea of the construction is as follows. Arrange the elements 
in a ^/n x ^/n grid. A quorum in a multi-grid consists of any choice of ^/b'+T rows and y/b + 1 



columns, as shown in Figure |l|. Formally, denote the rows and columns of the grid by Ri and 
Ci, respectively, where 1 < i < y/n. Then, the quorum system is 



M-Grid(6) 



y Cj- U J i?i : J, / C {1 . . . V^}, I J| = |/| = y/h+l 



Proposition 5.1 The multi-grid M-Grid(6) is a b-masking quorum system for b < {\/n — l)/2. 

Proof: Consider two quorums ii, 5 G M-Grid(6). If they have either a row or a column in 
common, then |i2 Pi 5*1 > y/n > 26 + 1 and we are done. Otherwise the intersection of S"s 
columns with i?'s rows is disjoint from the intersection of i2's columns with 5's rows, so |-RnS'| > 
2\/b + l\/b + 1 > 26 + 1. Therefore consistency holds. 

Resilience holds since / = A^T(M-Grid(6))-1 = ^-y/bTl > b. Therefore A^T(M-Grid(6)) 



6+1, and Lemma 3.6 finishes the proof. □ 



Proposition 5.2 £(M-Grid(6)) ^ 2^ 
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Figure 2: An RT(4, 3) system of depth h = 2, with one quorum shaded. 



Proof: Since M-Grid(5) is fair we can use Proposition 3^ to get £(M-Grid(5)) = c(M-Grid(6))/n. 

□ 

Remark: The load of M-Grid(6) is within a factor of \/2 from the optimal load which can be 
achieved for b ~ \/n/2. 

A disadvantage of the M-Grid system is its poor asymptotic crash probability. If crashes 
occur with some constant probability p then any configuration of crashes with at least one crash 



per row disables the system. Therefore, as shown by | KC91 , Woo96] 

Fp(M-Grid) > (1 - (1 -p)^)^ 1. 



5.2 Recursive threshold systems 

A recursive threshold system RT(A:,£) of depth h is built by taking a simple building block, 
which is an i-oi-k threshold system (with k > I > k/2), and recursively composing it over itself 
to depth h. In the sequel, we often omit the depth parameter h when it has no effect on the 



discussion. The RT systems generalize the recursive majority constructions of [MP92], the HQS 



system of | Kum91| is an RT(3, 2) system, and in fact the threshold system of [ |MR98a | can be 



viewed as a trivial RT(46 + 1, 36 + 1) system with depth h = 1. As an example throughout this 
section we will use the RT(4, 3) system, depicted in Figure 

Proposition 5.3 An KF{k,£) system of depth h is a fair quorum system, with n = k^ elements, 
quorums of size c{RT{k,i)) = , intersection size of TS(RT{k,i)) = {21 — k)^ , and minimal 
transversals of size MT{K^{k, £)) = {k - i + 1)''. 

Proof: The basic i-of-k system is symmetric (and therefore fair), with c{i-oi-k) = i, A4T{i-of-k) = 
k — i + I, and 2S{i-of-k) = 2i — k. The combinatorial parameters are computed by activating 



Theorem 4.7 h times, and the composition preserves the fairness. □ 



Plugging this into Corollary 3.7 we obtain 

Corollary 5.4 An RT(A;,^) system over a universe of size n is a b-masking quorum system for 
b = min{(nl°Sfe(2£-fc) _ i)/2, n^^f^kik-W) _ ^ □ 
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In the 3-of-4 example we have T5(3-of-4) = A4T(3-of-4) = 2 and c(3-of-4) = 3. Therefore for 
the whole system (to depth log4n) we get c(RT(4,3)) = n^°§4 3 = ^0.79^ ^^^j^ 2:5(RT(4,3)) = 
7WT(RT(4, 3)) = \/n and thus h = {^\/n — l)/2. Note that the basic 3-of-4 system is not even 
1-masking since intersections of size 2 are too small, however already from h = 2 (i.e., n = 16) 
we obtain a masking system. 



Proposition 5.5 The load C{RT{k,£)) = n 



Proof: Since RT(A;,^) is fair we can use Proposition |3.9| to get £(RT(A;,£)) = c{KT{k,t))/n. □ 

Remark: In general the load is suboptimal for this construction. For instance, in the RT(4, 3) 
system we obtain £(RT(4, 3)) = n""'^^. However for h = (-^/n — l)/2 we could hope for a load 
of v/(26 + l)/n = n-0-25. 

Proposition 5.6 There exists a unique critical probability < pc < 1/2 for which 

lim FJRT(k, i) of depth /i) = | °' ^ ^ 
/i-^oo [I, p>pc. 

Proof: Let g{p) be the crash probability function of the i-of-k system and let F{h) = Fp{BT{k, i) of depth h) 
denote the crash probability for the RT(/c, £) system of depth h. Then F{h) obeys the recurrence 

[p, h = 0. 

Now g{p) is a reliability function, and therefore it is "S-shaped" (see pP75| ]). This implies that 
there exists a unique critical probability < pc < 1 for which g{pc) = Pc, such that g{p) < p 
when p < Pc and g{p) > p when p > pc (and |PW95| shows that for quorum systems such as 
RT in fact pc < 1/2). Therefore if p < pc then repeated applications of recurrence would 
decrease F{h) arbitrarily close to 0, and when p > pc the limit is 1. □ 

Proposition 5.7 If p < l/i^'l-^) and i < k then Fp{RT{kJ)) < exp{-n{n^°Skik-i+i)))^ ^hich 
is optimal for systems with resilience f = n^'^^>'^^~^^^\ 



Proof: Let g{p) and F{h) be as in the proof of Proposition |5.6| . Any configuration of at least 
k — i + 1 crashes disables the i-oi-k system, so 

k 

dip) 



j=k-e+i ^-^^ 



By Lemma A. 2 (see Appendix) we have that 



Plugging this into (0) gives that 



F{h) < {^\) p^'^-^+i)^ 
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< 



1 



p 



Ifp < 1/(^:^1) then the last expression decays to zero with/i, soFp(RT(A:,£)) < exp(-Q(ni°Sfc(*=-^+i))). 
The lower bound of Proposition shows that 

„logj.{fc-<+l) 



so our analysis is tight. □ 



For the RT(4, 3) system a direct calculation shows that g{jp) = 6p^— 8p^+3p^ andpc = 0.2324. 
Therefore Proposition guarantees that when the element crash probability is in the range 
p < 0.2324 then — > when n — > 00. Furthermore, when p < 1/6 then Proposition 5^ shows 
that the decay is rapid, with Fp(RT(4, 3)) < {6p)^, which is optimal. 



6 Boosted finite projective planes 

In this section we introduce a family of 6-masking quorum systems, the boosted finite projective 
planes, which we denote by boostFPP. A boostFPP system is a composition of a finite projective 
plane (FPP) over a threshold system (Thresh). 

The first component of a boostFPP system is a finite projective plane of order q (a good 
reference on FPP's is [ Hal86| ] ) . It is known that FPP's exist for q = p'' when p is prime. Such an 



FPP has np = q^ + q + 1 elements, and quorums of size c(FPP) = q+1. This is a regular quorum 
system, i.e., it has intersections of size T5(FPP) = 1. The minimal transversals of an FPP are 
of size A^T(FPP) = g + 1 (in fact the only transversals of this size are the quorums themselves). 
The load of FPP was analyzed in [|NW98| and shown to be C{¥Y>V) = ^ ^ l/\/raF, which is 



optimal for regular quorum systems. 

The second component of a boostFPP is a Thresh system, with ut = 46+1 elements and 
a threshold of 36 + 1. This is a 6-masking quorum system in itself, with X5 (Thresh) = 26 + 1, 
A^T(Thresh) = 6 + 1 and a load of £ (Thresh) 3/4. 

Proposition 6.1 Let boostFPP(g, 6) = FPP(g) o Thresh(36 + 1 of 46 + 1). Then the composed 
system has n = (46 + l){q'^ + g + 1) elements, with quorums of size c(boostFPP(g, 6)) = (36 + 
l)(g + 1), intersections of size T5(boostFPP(g, 6)) =26 + 1 and minimal transversals of size 
A^T(boostFPP(g, 6)) = (b+l){q+l). Therefore boostFPP(g, 6) is a b-masking quorum system. 

Proof: We obtain the combinatorial parameters by plugging the values of the component systems 



into Theorem O. By Corollary 3^ we have that the system can mask min{(6+l)(g+l) — 1, 6)} = 
6 failures. □ 

Proposition 6.2 £(boostFPP(g, 6)) ~ which is optimal for b-masking quorum systems with 
n ~ Abq^ elements. 
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Proof: boostFPP(g, b) is a fair quorum system since both its components are fair, so by Propo- 
sition ^]9| we have 

£(b„„stFPP(,,«) = ■'(''°"^FPP(,.6)) 

{3b+l){q + l) 3 



(46 + l)(g2+g+l) Aq' 
On the other hand, for 6-masking systems with n ~ Abq'^ elements the lower bound of Theo- 



rem 4.1 gives 



£(boostFPP(g,6)) > W— « □ 
V n V2g 

Note that the optimality of the load holds for any choice of q and b. Therefore when the 
number of servers (or elements) increases, the boostFPP(g, b) system can scale up using different 
policies while maintaining load optimality. There are two extremal policies: 

1. Fix q and increase b; then the system can mask more failures when new servers are added, 
however the load on the servers does not decrease. 

2. Fix b and increase q; then the load decreases when new servers are added, but the number 
of failures that the system can mask remains unchanged. 

It is important to note that systems of arbitrarily high resilience can be constructed using 
the first policy since b can be chosen independently of q. In particular, we can choose b = q"" 

a-\-'2 Q 

for any a > 0. Then the resulting system has n Abq^ = 46~^, and so 6 « (f)"^^) thus 
asymptotically approaching the resilience upper bound of ^. 

Finally we analyze the crash probability of boostFPP. The following proposition shows that 
boostFPP has good availability as long as p < 1 /4. 

Proposition 6.3 If p < 1/4 then Fp (boostFPP (g, 6)) < exp(-17(6 - logg)). 

Proof: We start by estimating Fp (Thresh). Let # crashed denote the number of crashed elements 
in a universe of size 46-1-1. Let 7 = _ thus < 7 < 1 when p < 1/4. Then using the 
Chernoff bound we obtain 

Fp(Thresh) = F{# crashed > b + 1) 

= F{#crashed > {p + 7)(46 + 1)) 

Next we estimate -Fp(FPP). Let Qq £ FPP be some quorum. Then 

Fp(FPP) = 1 - P(3Q G FPP : Q is alive) < 
1 - P(Qo is alive) = 1 - (1 - py+^ < (q + l)p. (6) 
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Using Theorem 4.7 we plug (^) into to obtain 



Fp(boostFPP(g, b)) <{q + i)e-''(i-4p)V2 = g-n(6-iogg)^ □ 

Remarks: 

• In general the crash probability is not optimal; since A1T(boostFPP(g, 6)) ~ bq then the 
lower bound of Proposition shows we could hope for a crash probability of ex.p{—^}{bq)) . 
Nevertheless if q is constant then Fp is asymptotically optimal, and if b^ q then the gap 
between the upper and lower bounds is small. 



The final estimate we get for i<'p(boostFPP) seems poor, as the bound is higher than 
the crash probability of the Thresh components. However this is not an artifact of over- 
estimates in our analysis. Rather, it is a result of the property that the crash probability 
of FPP is higher than p, and in fact Fp(FPP) ^ 1 as shown by [|RST92| , |Woo96| ]. In this 



light it is not surprising that boostFPP does not have an optimal crash probability. 

The requirement p < 1/4 is essential for this system; if p > 1/4 then in fact -Fp (boostFPP) 
— > 1 as n — > oo. 



7 The multi-path system 

Here we introduce the construction we call the Multi-Path system, denoted by M-Path. The 
elements of this system are the vertices of a triangulated square ^/n x ^/n grid; formally, the 
vertices are the points £ : 1 < < \/n; i,j € Z}. The triangulated grid has an 

edge between ji) and (^2,^2) if one of the following three conditions holds: (i) ii = 12 and 
32 = Ji + 1; (ii) ji = 32 and i2 = ii -|- 1; (iii) 22 = ^1 — 1 and j2 = ji -|- 1. A quorum in the 
M-Path system consists of y/2b + \ disjoint paths from the left side to the right side of the grid 
(LR paths) and \/2b + \ disjoint top-bottom (TB) paths (see Figure 

The M-Path system has several characteristics similar to the basic M-Grid system of Sec- 
tion |5|, namely an ability to mask b = 0{y/n) failures, and optimal load. Its major advantage is 
that it also has an optimal crash probability Fp. Moreover, it is the only construction we have 
for which Fp ^ as n — > 00 when the individual crash probability p is arbitrarily close to 1/2. 
We are able to prove this behavior of Fp using results from Percolation theory |Kes82, Gri89(| . 



Remark: The system we present here is based on a triangular lattice, with elements correspond- 



ing to vertices, as in | WB92, Baz96]. We have also constructed a second system which is based 



on the square lattice with elements corresponding to the edges, as in [NW98|. The properties of 



this second system are almost identical to those of M-Path, so we omit it. 



Proposition 7.1 M-Path(6) has minimal quorums of size c(M-Path) < 2^JrU2b + Y), minimal 
intersections of size X5(M-Path) > 26-1-1, and minimal transversals of size 7WT(M-Path) = 
^/n — \/2b + 1 + 1. Therefore M-Path is a b-masking quorum system for b < ^Jn — \f2n}-l^ . 
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Figure 3: A multi-path construction on a 9 x 9 grid, 6 = 4, with one quorum shaded. 



Proof: Let Qi,Q2 G M-Path(6). Then the V26 + 1 LR paths of Qi intersect the VW+T TB 
paths of <52 in > 26 + 1 elements, since the LR and TB paths are disjoint. As in the M-Grid 
system we have that A1T(M-Path(6)) = yjn — \/26 + 1 + 1, so when 6 < y/n — \/2'n}/^ it follows 
that MT (M-Path(6)) > 6 + 1 and we are done. □ 



Proposition 7.2 £(M-Path(6)) < 2./^, which is optimal. 



Proof: The strategy only uses straight line LR and TB paths. It picks \/26"+T of the ^/n rows 
uniformly at random and likewise for the columns. Clearly the load equals the probability of 
accessing some element in position which is 

£(M-Path) < P(row i chosen) + P(column j chosen) 
< 2( Vf ^ 



V26+ 1 -11' V V26 + 1 



2V26 + 



By Corollary 4.2 this is optimal. □ 



Proposition 7.3 Fp(M-Path(6)) < exp(— r2(-y/n — Vh)) for any p < 1/2, which is optimal for 
systems with resilience f = 0{y/n — ^/h). 

Proof: We use the notation IPp(<?) to denote the probability of event £ defined on the grid when 
the individual crash probability is p. A path is called "open" if all its elements are alive. 

Let LR be the event "there exists an open LR path in the grid", and let LRk be the event 
"there exist k open LR paths". A failure configuration in M-Path(6) is one in which either 
\/26 + 1 open LR paths or \/26 + 1 open TB paths do not exist. By symmetry we have that 



Fp(M-Path(6)) < 2Pp(Li?^5E+T) = 2(1 " ^p{LR^j^)). (7) 
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Fix some such that p < p' < 1/2. Then by Theorem B.3 (see Appendix) we have that 



1 - Pp(Li?^) < (^-^J [1 - ¥,,{LR)]. (8) 

Plugging the bound on Pp/(Li?) from Theorem BT into and (|^) yields 



F„(M-Path(6)) < 2(— ^) 
\p' -pj 

2e-^(P')v^+(v'25+T-i) In (^) 

for some function V'(p') > 0. Now \/2h + 1 = 0(n^/'^), so for large enough n we can certainly 
write 

Fp(M-Path(6)) < exp(-0(v/n - 
This is optimal by Proposition [4.3| . □ 

8 Discussion 

We have presented four novel constructions of 6-masking quorum systems. For the first time in 
this context, we considered the resilience of such systems to crash failures in addition to their 
tolerance of (possibly fewer) Byzantine failures. Each of our constructions is optimal in either 
its load or its crash probability (for sufficiently small p). Moreover, one of our constructions, 
namely M-Paths, is optimal in both measures. One of our constructions is achieved using a novel 
boosting technique that makes all known benign fault-tolerant quorum constructions available 
for Byzantine environments (of appropriate sizes). In proving optimality of our constructions, 
we also contribute lower bounds on the load and crash probability of any 6-masking quorum 
system. 

The properties of our various constructions are summarized in Table ^, alongside the prop- 
erties of two other 6-masking constructions proposed in [MR98a], namely Threshold and Grid. 



Determining the best quorum construction depends on the goals and constraints of any 
particular settings, as no system is advantageous in all measures. For example, suppose we 
fix n to be 1024, the desired load C to be approximately 1/4, and assume that the individual 
failure probability of components is 1/8. In these settings, an M-Grid system can tolerate 6 = 15 
Byzantine failures and up to / = 28 benign failures, but has a failure probability Fp > 0.638. 
In the same settings, a boostFPP system (with n = 1001, g = 3) can tolerate h = 19, up 
to / = 79 benign failures, with somewhat better failure probability: it has Fp < 0.372. The 
M-Path construction, with 4 LR and 4 TB paths per quorum, has 6 = 7 here, and can tolerate 
up to / = 29 benign failures, but has a good crash probability: Fp < 0.001 (using the estimate 



following Theorem BA, together with Theorem B^ with p' = 1/7). In this setting, the RT(4, 3) 



construction, with depth /i = 5, is the best, with 6 = 15, / = 31 and an excellent failure 
probability of only Fp < 0.0001. 
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System 


b < 


/ 


c 




Threshold [MR98a| 


n/4 


0(n - b) 


1/2 + 0{b/n) 


exp(-0(/)) * 


Grid [|MR98a| 




- b) 


0{b/^) 


— > 1 


M-Grid 




- Vb) 




> 1 


RT{k, ly 


0(mm{n°i,n"2}) t 


0{b) 


„-(l-log,^) 


exp(-17(/)) * 


boostFPP 


n/4 


0{Vb^) 


0{./b/n) t 


exp(-0(6 - log(n/6))) 


M-Path 


(l-o(l))V^ 


0{^-Vb) 




exp(-17(/)) ' 



* Optimal for 6-masking systems 

* Optimal for /-resilient systems 

* ai = logfe(2^ - k) and 02 = logfe(fc -l+l) 



Table 2: Constructions in this paper (n = number of servers). 



More generally, if masking large numbers of Byzantine server failures is important, then of 
the systems listed in Table |2|, only Threshold and boostFPP can provide the highest possible 
masking ability, i.e., up to 6 < n/4. However, Threshold can mask n/4 Byzantine failures for 
any system size, whereas boostFPP approaches such degree of Byzantine resilience only for very 
large n. If, on the other hand, load is more crucial, then Threshold suffers in load whereas 
boostFPP offers reduced load, as do the other three systems in this paper, albeit with lower 
masking ability. If masking fewer Byzantine server failures is allowable, then other quorum 
constructions can be used, in particular RT and M-Path. These two constructions have similar 
masking ability, resilience, and load, but M-Path has asymptotically superior crash probability 
when p is close to 1/2. 

Finally, we note that it is impossible to achieve optimal resilience and load simultaneously: 
Since necessarily / < c(Q), Theorem 4.1 implies that / < n£(Q), i.e., when load is low then 
so is resilience, and when resilience is high then so is load. In order to break this tradeoff, 
in [ MRWW98 | we propose relaxing the intersection property of masking quorum systems, so 
that "quorums" chosen according to a specific strategy intersect each other in enough correct 
servers to maintain correctness of the system with a high probability. 
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Appendix 



A Combinatorial Lemmas 

{k - ciy. d\ 



Lemma A.l Let < i,d < k be integers. Then < '''' 
Proof: 

id+i) 



kldl{k-dy. 



< 



{k - d)l 



Q {d + i)\{k-d- i)\k\ {k - d - i)\ {d + i)\ - {k - d - 
Lemma A. 2 Let < d < k be integers and let p € [0, 1]. Then 

k 



k-d 



Proof: 



E(*)p^(i-p)'-<Q/. 

j=d ^■'^ j=d U 



so it suffices to show that the last sum is < 1. But using Lemma |A.l| we get 



k (k 



() 



i=0 \d) 



j=d \d 
k-d 



< 



E 

i=0 



k-d 



k-d- 



b + (i-p)]'"' = i 



□ 



□ 



B Theorems of Percolation Theory 

In this section we list the definitions and results that are used in our analysis of the M-Path 
system, following | Kes82 , GriSS | . 

The percolation model we are interested in is as follows. Let Z be the graph of the (infinite) 
triangle lattice in the plane. Assume that a vertex is closed with probability p and open with 
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probability 1 — p, independently of other vertices. This model is known as site percolation on 
the triangle lattice. Another natural model, which plays a minor role in our work, is the bond 
percolation model. In it the edges are closed with probability p. 

A key idea in percolation theory is that there exists a critical probability, pc, such that graphs 
with p < Pc exhibit qualitatively different properties than graphs with p > pc- For example, Z 
with p < Pc has a single connected (open) component of infinite size. When p > pc there is no 



such component. For site percolation on the triangle pc = 1/2 |Kes80|. 

The following theorem shows that when the probability p for a closed vertex is below the 
critical probability, the probability of having long open paths tends to 1 exponentially fast. 



Recall that LR is the event "there exists an open LR path in the ^/n x ^/n grid". Then [MenSC] 
(see also [ |Gri8g| ] p. 287) imphes 



Theorem B.l If p< 1/2 then ¥p{LR) > 1 - e'^^^P^v^, for some ip{p) > independent of n. 

Remark: The dependence of ^/^ on p is such that 'ip{p) when p 1/2. However, for p's not 
too close to 1/2 we can obtain concrete estimates using elementary techniques. For instance, a 
counting argument similar to that of Bazzi |Baz96] shows that 



when p < 1/3. 

Definition B.2 Let £ be an event defined in the percolation model. Then the interior of £ with 
depth r, denoted Ir{£), is the set of all configurations in £ which are still in £ even if we perturb 
the states of up to r vertices. 

We may think of Ir{£) as the event that £ occurs and is 'stable' with respect to changes in 
the states of r or fewer vertices. The definition is useful to us in the following situation. If LR 
is the event "there exists an open left-right path in a rectangle D" , then it follows that I^- {LR) 
is the event "there are at least r + 1 disjoint open left-right paths in L)" . 



Theorem B.3 |ACC+83[| Let £ be an increasing event and let r be a positive integer. Then 

i-p,(i.(£:))<(i^y[i-F,,(^:)] 

whenever < p < p' < 1. 

The theorem amounts to the assertion that if £ is likely to occur when the crash probability 
is p' , then Ir{£) is likely to occur when the crash probability p is smaller than p' . 
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